hw1
desriptive statistics
probability
First homeowrk on descriptive statistics and probability
Author

Shoshana Buck

Published

October 3, 2022

Code
#install libraries
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(readxl)
library(ggplot2)

Question 1

Use the LungCapData to answer the following questions.

Code
# read in data
lung_cap<- read_xls("_data/LungCapData.xls")
lung_cap
# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows

a. What does the distribution of LungCap look like?

Code
#plot histogram
hist(lung_cap$LungCap)

The histogram shows that the distribution is pretty close to a normal distribution, with an almost a bell shaped curve. Meaning that the data near the mean are more of a frequent occurrence which is true because there are fewer observations near the margins.

b. compare the probability distribution of the LungCap with respect to Males and Females?

Code
#Box plot
ggplot(lung_cap, aes (Gender,LungCap)) + geom_boxplot()

The box plot shows that male’s have a slightly higher lung capacity than females. Female’s have more values in the first quartile range and a lower minimum value than male’s. On the other hand male’s have a higher max value and more values in the 3rd quartile range.

c. Compare the mean lung capacities for smokers and non-smokers. Does it make sense?

Code
lung_cap %>% 
  group_by(Smoke) %>% 
  summarise(avg_lung_cap = mean(LungCap))
# A tibble: 2 × 2
  Smoke avg_lung_cap
  <chr>        <dbl>
1 no            7.77
2 yes           8.65

This does not make sense. Smokers have a higher sample mean than non-smokers which intuitively does not make sense because we would assume non-smokers would have a higher lung capacity.

d. Examine the relationship between Smoking and Lung Capacity within age groups: “less than or equal to 13”, “14 to 15”, “16 to 17”, and “greater than or equal to 18”.

Code
#categorical variable of age_groups 
df<- lung_cap %>% 
group_by(Smoke,LungCap) %>% 
  summarise(age_group = case_when(Age<=13 ~ "13 & under",Age ==14 | Age == 15 ~ "14 to 15",Age== 16 | Age == 17 ~ "16 to 17", Age>=18 ~ "18 & older")) 
`summarise()` has grouped output by 'Smoke', 'LungCap'. You can override using
the `.groups` argument.
Code
#mean of lung capacity with new variable
df %>% 
  group_by(Smoke, age_group) %>% 
  summarise(avg_lung_cap = mean(LungCap)) %>% 
  arrange(desc(avg_lung_cap))
`summarise()` has grouped output by 'Smoke'. You can override using the
`.groups` argument.
# A tibble: 8 × 3
# Groups:   Smoke [2]
  Smoke age_group  avg_lung_cap
  <chr> <chr>             <dbl>
1 no    18 & older        11.1 
2 yes   18 & older        10.5 
3 no    16 to 17          10.5 
4 yes   16 to 17           9.38
5 no    14 to 15           9.14
6 yes   14 to 15           8.39
7 yes   13 & under         7.20
8 no    13 & under         6.36
Code
#histogram
ggplot(df, aes(x = LungCap)) + facet_grid(Smoke ~age_group) +geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Using the package ggplot I used the function facet_grids to show a good visualization of the lung capacity between non-smokers and smokers within each age group. Looking at the histograms all age_groups that are non-smokers have a higher sample mean proving that non-smokers have

e. Compare the lung capacities for smokers and non-smokers within each age group. Is your answer different from the one in part d. What could possibly be going on here?

Code
# Mean of non-smokers 13 & younger
df %>% 
  filter(Smoke == 'no' & age_group == '13 & under') %>% 
  pull(LungCap) %>% 
  mean()
[1] 6.358746
Code
# Mean of smokers 13 & younger
df %>% 
  filter(Smoke == 'yes' & age_group == '13 &under') %>% 
  pull(LungCap) %>% 
  mean()
[1] NaN
Code
#Mean of non-smokers 14 to 15
df %>% 
  filter(Smoke == 'no' & age_group == '14 to 15') %>% 
  pull(LungCap) %>% 
  mean()
[1] 9.13881
Code
#Mean of smokers 14 to 15
df %>% 
  filter(Smoke == 'yes' & age_group == '14 to 15') %>% 
  pull(LungCap) %>% 
  mean()
[1] 8.391667
Code
#Mean of non-smokers 16 to 17
df %>% 
  filter(Smoke == 'no' & age_group == '16 to 17') %>% 
  pull(LungCap) %>% 
  mean()
[1] 10.46981
Code
#Mean of smokers 16 to 17
df %>% 
  filter(Smoke == 'yes' & age_group == '16 to 17') %>% 
  pull(LungCap) %>% 
  mean()
[1] 9.38375
Code
#Mean of non-smokers 18 & older
df %>% 
  filter(Smoke == 'no' & age_group == '18 & older') %>% 
  pull(LungCap) %>% 
  mean()
[1] 11.06885
Code
# Mean of smokers 18 & older
df %>% 
  filter(Smoke == 'yes' & age_group == '18 & older') %>% 
  pull(LungCap) %>% 
  mean()
[1] 10.51333
Code
lung_cap
# A tibble: 725 × 6
   LungCap   Age Height Smoke Gender Caesarean
     <dbl> <dbl>  <dbl> <chr> <chr>  <chr>    
 1    6.48     6   62.1 no    male   no       
 2   10.1     18   74.7 yes   female no       
 3    9.55    16   69.7 no    female yes      
 4   11.1     14   71   no    male   no       
 5    4.8      5   56.9 no    male   no       
 6    6.22    11   58.7 no    female no       
 7    4.95     8   63.3 no    male   yes      
 8    7.32    11   70.4 no    male   no       
 9    8.88    15   70.5 no    male   no       
10    6.8     11   59.2 no    male   no       
# … with 715 more rows

f. Calculate the correlation and covariance between Lung Capacity and Age. (use the cov() and cor() functions in R). Interpret your results.

Code
#Correlation
cor(lung_cap$LungCap, lung_cap$Age)
[1] 0.8196749
Code
#Covariance
cov(lung_cap$LungCap, lung_cap$Age)
[1] 8.738289

The correlation is 0.81 which is pretty close to +1, meaning that the there is a positive relationship between lung capacity and age. The covariance is also high which shows that the two variables of lung capacity and age have a positive relationship. #2 Let X = number of prior convictions for prisoners at a state prison at which there are 810 prisoners.

Code
# x= prior convictions 
x<-c(0, 1, 2, 3, 4)
frequency<-c(128, 434, 160, 64, 24)
state_prison<- data_frame(x,frequency)
Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.
Code
state_prison
# A tibble: 5 × 2
      x frequency
  <dbl>     <dbl>
1     0       128
2     1       434
3     2       160
4     3        64
5     4        24

a. What is the probability that a randomly selected inmate has exactly 2 prior convictions?

Code
# 2 prior convictions.P(2)/total
160/810
[1] 0.1975309

There is a 1.9% probability that a randomly selected inmate has exactly 2 prior convictions.

b. What is the probability that a randomly selected inmate has fewer than 2 prior convictions?

Code
#  less than 2 prior convictions. (P(0) + P(1))/total 
(128+434)/810
[1] 0.6938272

There is a 6.9% probability that a randomly selected inmate has fewer than 2 prior convictions.

c. What is the probability that a randomly selected inmate has 2 or fewer prior convictions?

Code
# 2 or fewer prior convictions. (P(0) + P(1) + P(2)) +total
(128+434+160)/810
[1] 0.891358

There is a 8.9% probability that a randomly selected inmate has 2 or fewer prior convictions.

d. What is the probability that a randomly selected inmate has more than 2 prior convictions?

Code
# More than 2 prior convictions. (P(3) +P(4)) + total
(64+24)/810
[1] 0.108642

There is a 10.8% probability that a randomly selected inmate has more than 2 prior convictions.

e. What is the expected value for the number of prior convictions?

Code
#Prior convictions. ((0*P(0)) +(1*(P(1)) + (2*P(2)) + (3*P(3)) + (4*P(4)))
df<-((128*0/810) +(434*1/810) +(160*2/810) +(64*3/810) +(24*4/810)) 
mean(df)
[1] 1.28642

The expected value for number of prior convictions is 1.28

f. Calculate the variance and the standard deviation for the Prior Convictions.

Code
# variance
v<- ((0-1.28)^2) *(128/810) +((1-1.28)^2) * (434/810)+((2-1.28)^2) * (160/810)+((3-1.28)^2) *(64/810) +((4-1.28)^2) * (24/810)
v
[1] 0.8562765
Code
# standard deviation
sd<- sqrt(v)
sd
[1] 0.9253521

The variance is 0.856 and the standard deviation is 0.925.